Dynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware
نویسنده
چکیده
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this thesis, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by an average of 47% for an estimated area increase of 8%.
منابع مشابه
Improved GPU SIMD control flow efficiency via hybrid warp size mechanism
High single instruction multiple data (SIMD) efficiency and low power consumption have made graphic processing units (GPUs) an ideal platform for many complex computational applications. Thousands of threads can be created by programmers and grouped into fixed-size SIMD batches, known as warps. High throughput is then achieved by concurrently executing such warps with minimal control overhead. ...
متن کاملA software-based dynamic-warp scheduling approach for load-balancing the Viola-Jones face detection algorithm on GPUs
Face detection is a key component in applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. The Viola–Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive pe...
متن کاملAsynchronous Problems on SIMD Parallel Computers
One of the essential problems in parallel computing is: can SIMD machines handle asynchronous problems? This is a di cult, unsolved problem because of the mismatch between asynchronous problems and SIMD architectures. We propose a solution to let SIMD machines handle general asynchronous problems. Our approach is to implement a runtime support system which can run MIMD-like software on SIMD har...
متن کاملExploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration
Recent advances in multi-million-gate platform FPGAs have made it possible to design and implement complex parallel systems on a programmable chip (PSOPCs) that also incorporate hardware floating-point units (FPUs). These options take advantage of resource reconfiguration. In contrast to the majority of the FPGA community that still employs reconfigurable logic to develop algorithm-specific cir...
متن کاملSimplified vector-thread architectures for flexible and efficient data-parallel accelerators
This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the mult...
متن کامل